### CS152: Computer Systems Architecture The Hardware/Software Interface



Sang-Woo Jun Winter 2021



Large amount of material adapted from MIT 6.004, "Computation Structures", Morgan Kaufmann "Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition", and CS 152 Slides by Isaac Scherson

### Course outline

Part 1: The Hardware-Software Interface

- $\circ$   $\,$  What makes a 'good' processor?
- Assembly programming and conventions
- Part 2: Recap of digital design
  - Combinational and sequential circuits
  - How their restrictions influence processor design
- Part 3: Computer Architecture
  - Computer Arithmetic
  - Simple and pipelined processors
  - Caches and the memory hierarchy
- Part 4: Computer Systems
  - Operating systems, Virtual memory

# Eight great ideas

- Design for Moore's Law
  Use abstraction to simplify design
- Make the common case fast
- Performance via parallelism
- Performance via pipelining
- Performance via prediction
- □ Hierarchy of memories
- Dependability via redundancy













PREDICTION



PPELINING



DEPENDABLLTY

HIERARCHY

### Great idea:

# Use abstraction to simplify design

- □ Abstraction helps us deal with complexity by hiding lower-level detail
  - $\circ$  One of the most fundamental tools in computer science!
  - $\circ$  Examples:
    - Application Programming Interface (API),
    - System calls,
    - Application Binary Interface (ABI),
    - Instruction-Set Architecture

# Below your program

#### □ Application software

• Written in high-level language (typically)

#### □ System software

- $\circ~$  Compiler: translates HLL code to machine code
- Operating System: service code
  - Handling input/output
  - Managing memory and storage
  - Scheduling tasks & sharing resources

### Hardware

○ Processor, memory, I/O controllers



### The Instruction Set Architecture

- An Instruction-Set Architecture (ISA) is the abstraction between the software and processor hardware
  - The 'Hardware/Software Interface'
  - Different from 'Microarchitecture', which is how the ISA is implemented
- A consistant ISA allows software to run on different machines of the same architecture
  - o e.g., x86 across Intel, AMD, and various speed and power ratings

# Levels of program code

#### □ High-level language

- $\circ~$  Level of abstraction closer to problem domain
- $\circ$   $\,$  Provides for productivity and portability
- □ Assembly language
  - Textual representation of instructions
- □ Hardware representation
  - Binary digits (bits)
  - $\circ~$  Encoded instructions and data

Instruction Set Architecture (ISA) is the agreement on what this will do



Binary machine language program (for RISC-V)

**High-level** 

language program

Assembly

language

program

(for RISC-V)

(in C)

swap(int v[], int k) {int temp; temp = v[k];v[k] = v[k+1];v[k+1] = temp;Compiler swap: slli x6, x11, 3 x6, x10, x6 add ] d x5.0(x6)l d x7, 8(x6) x7, 0(x6) sd x5. 8(x6) sd jalr x0. 0(x1) Assembler 00000000011010110010011001100010011 0000000011001010000001100110011 00000000000001100110010100000011

### A RISC-V Example ("00A9 8933")

□ This four-byte binary value will instruct a RISC-V CPU to perform

- $\circ$  add values in registers x19 x10, and store it in x18
- o regardless of processor speed, internal implementation, or chip designer

| 31 2    | 25 24 20 | 19 15  | 14     | 12 11 | 7 6 0      |
|---------|----------|--------|--------|-------|------------|
| funct7  | rs2      | rs1    | funct3 | rd    | opcode     |
| 7       | 5        | 5      | 3      | 5     | 7          |
| 0000000 | 01010    | 10011  | 000    | 10010 | 0110011    |
| ADD     | rs2=10   | rs1=19 | ADD    | rd=18 | Reg-Reg OP |

#### add x18,x19,x10

Source: Yuanqing Cheng, "Great Ideas in Computer Architecture RISC-V Instruction Formats"

### Some history of ISA

□ Early mainframes did not have a concept of ISAs (early 1960s)

- Each new system had different hardware-software interfaces
- Software for each machine needed to be re-built
- □ IBM System/360 (1964) introduced the concept of ISAs
  - Same ISA shared across five different processor designs (various cost!)
  - Same OS, software can be run on all
  - Extremely successful!
- □ Aside: Intel x86 architecture introduced in 1978
  - Strict backwards compatibility maintained even now (The A20 line... ⊗)
  - Attempted clean-slate redesign multiple times but failed (iAPX 432, EPIC, ...)

### IBM System/360 Model 20 CPU



Source: Ben Franske, Wikipedia

### CS152: Computer Systems Architecture What Makes a "Good" ISA?



Sang-Woo Jun Winter 2021



Large amount of material adapted from MIT 6.004, "Computation Structures", Morgan Kaufmann "Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition", and CS 152 Slides by Isaac Scherson

## What makes a 'good' ISA?

#### □ Computer architecture is a complicated art...

- $\circ~$  No one design method leads to a 'best' computer
- Subject to workloads, use patterns, criterion, operation environment, ...
- Important criteria: Given the same restrictions,
  - High performance!
  - $\circ$  Power efficiency
  - Low cost
  - 0 ...
- May depend on target applications
  - o E.g., Apple knows (and cares) more about its software than Intel

### What does it mean to be high-performance?

#### □ In the 90s, CPUs used to compete with clock speed

- $\circ$  "My 166 MHz processor was faster than your 100 MHz processor!"
- Not very representative between different architectures
- $\circ~$  2 GHz processor may require 5 instructions to do what 1 GHz one needs only 2
- □ Let's define performance = 1/execution time
- **D** Example: time taken to run a program
  - **10s on A, 15s on B**
  - Execution TimeB / Execution TimeA
    = 15s / 10s = 1.5
  - $\circ~$  So A is 1.5 times faster than B

Performance<sub>x</sub>/Performance<sub>y</sub> = Execution time<sub>y</sub>/Execution time<sub>x</sub> = n

## Measuring execution time

#### Elapsed time

- Total response time, including all aspects
  - Processing, I/O, OS overhead, idle time
- Determines system performance

#### CPU time (Focus here for now)

- $\circ$   $\,$  Time spent processing a given job  $\,$ 
  - Discounts I/O time, other jobs' shares
- $\circ~$  Comprises user CPU time and system CPU time
- Different programs are affected differently by CPU and system performance

# CPU clocking

Operation of digital hardware governed by a constant-rate clock



- □ Clock period: duration of a clock cycle
  - $\circ$  e.g., 250ps = 0.25ns = 250×10<sup>-12</sup>s
- □ Clock frequency (rate): cycles per second
  - $\circ$  e.g., 4.0GHz = 4000MHz = 4.0×10<sup>9</sup>Hz

### CPU time

- □ Performance improved by
  - $\circ$  Reducing number of clock cycles
  - $\circ$   $\,$  Increasing clock rate
  - $\circ~$  Hardware designer must often trade off clock rate against cycle count

 $CPUTime = CPUClock Cycles \times Clock Cycle Time$  $= \frac{CPUClock Cycles}{Clock Rate}$ 

### Instruction count and CPI

- □ Instruction Count for a program
  - $\circ~$  Determined by program, ISA and compiler
- □ Average cycles per instruction
  - $\circ~$  Determined by CPU hardware
  - $\circ~$  If different instructions have different CPI
    - Average CPI affected by instruction mix

 $\begin{aligned} \text{Clock Cycles} &= \text{Instructio n Count} \times \text{Cycles per Instructio n} \\ \text{CPU Time} &= \text{Instructio n Count} \times \text{CPI} \times \text{Clock Cycle Time} \\ &= \frac{\text{Instructio n Count} \times \text{CPI}}{\text{Clock Rate}} \end{aligned}$ 

### **CPI** example

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2

Same ISA

 $\begin{array}{l} \mathsf{CPU\,Time}_A = \mathsf{Instructio}\,\mathsf{n}\,\mathsf{Count}\times\mathsf{CPI}_A\times\mathsf{Cycle\,Time}_A\\ = \mathsf{I}\times 2.0\times 250\mathsf{ps}\,=\mathsf{I}\times 500\mathsf{ps} & & \mathsf{A}\,\mathsf{is\,faster...}\\ \mathsf{CPU\,Time}_B = \mathsf{Instructio}\,\mathsf{n}\,\mathsf{Count}\times\mathsf{CPI}_B\times\mathsf{Cycle\,Time}_B\\ = \mathsf{I}\times 1.2\times 500\mathsf{ps}\,=\mathsf{I}\times 600\mathsf{ps}\\ \end{array}$ 

### CPI in more detail

□ If different instruction classes take different numbers of cycles

Clock Cycles = 
$$\sum_{i=1}^{n} (CPI_i \times Instruction Count_i)$$

\*Not always true with michroarchitectural tricks (Pipelining, superscalar, ...)

Weighted average CPI



### Performance summary

#### Performance depends on

- Algorithm: affects Instruction count, (possibly CPI)
- Programming language: affects Instruction count, (possibly CPI)
- Compiler: affects Instruction count, CPI
- Instruction set architecture: affects Instruction count, CPI, Clock speed

$$CPU Time = \frac{Instructions}{Program} \times \frac{Clock cycles}{Instruction} \times \frac{Seconds}{Clock cycle}$$

A good ISA: Low instruction count, Low CPI, High clock speed

# Some goals for a good ISA

|                                         | Low CPI                            |  |  |
|-----------------------------------------|------------------------------------|--|--|
| Low instruction count                   | High clock speed                   |  |  |
| Each instruction should do<br>more work | Each instruction should be simpler |  |  |

How do we reconcile?

### Real-world examples: Intel i7 and ARM Cortex-A53





CPI of ARM Cortex-A53 on SPEC2006 Benchmarks

CPI of Intel i7 920 on SPEC2006 Benchmarks

### CS152: Computer Systems Architecture Some ISA Classifications



Sang-Woo Jun Winter 2021



Large amount of material adapted from MIT 6.004, "Computation Structures", Morgan Kaufmann "Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition", and CS 152 Slides by Isaac Scherson

# Eight great ideas

- Design for Moore's Law
  - Use abstraction to simplify design

today

- Make the common case fast
- Performance via parallelism
- Performance via pipelining
- Performance via prediction
- □ Hierarchy of memories
- Dependability via redundancy



### The RISC/CISC Classification

#### □ Reduced Instruction-Set Computer (RISC)

- $\circ$  Precise definition is debated
- $\circ~$  Small number of more general instructions
  - RISC-V base instruction set has only dozens of instructions
  - Memory load/stores not mixed with computation operations (Different instructions for load from memory, perform computation in register)
  - Often fixed-width encoding (4 bytes for base RISC-V)
- Complex operations implemented by composing general ones
  - Compilers try their best!
- RISC-V, ARM (Advanced RISC Machines),
  MIPS (Microprocessor without Interlocked Pipelined Stages),
  SPARC, ...

### The RISC/CISC Classification

#### □ Complex Instruction-Set Computer (CISC)

- Precise definition is debated (Not RISC?)
- Many, complex instructions
  - Various memory access modes per instruction (load from memory? register? etc)
  - Typically variable-length encoding per instruction
  - Modern x86 has thousands!
- Intel x86,
  IBM z/Architecture,
- 0 ...

### The RISC/CISC Classification

#### □ RISC paradigm is winning out

- $\circ~$  Simpler design allows faster clock
- Simpler design allows efficient microarchitectural techniques
  - Superscalar, Out-of-order, ...
- $\circ~$  Compilers very good at optimizing software
- □ Most modern CISC processors have RISC internals
  - CISC instructions translated on-the-fly to RISC by the front-end hardware
  - Added overhead from translation (silicon, power, performance, ...)